Tensorflow Classifiers

In this article, we demonstrate implementing the Tensorflow Linear classifier model by an example. The details regarding this dataset can be found in the Diagnostic Wisconsin Breast Cancer Database.

Features with high variance

Moreover, high variance for some features can hurt our modeling process. For this reason, we would like to standardize features by removing the mean and scaling to unit variance.

Train and Test sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Feature Columns

Create the feature columns, using the original numeric columns as is and one-hot-encoding categorical variables.

Input Function

The input function specifies how data is converted to a tf.data.Dataset that feeds the input pipeline in a streaming fashion. Moreover, an input function is a function that returns a tf.data.Dataset object which outputs the following two-element tuple:

Building the input pipeline

Modeling: Boosted Trees Classifier

ROC Curves

Confusion Matrix

The confusion matrix allows for visualization of the performance of an algorithm. Note that due to the size of data, here we don't provide a Cross-validation evaluation. In general, this type of evaluation is preferred.

Train in Memory

An alternative way to train a model with boosting performance is using the train_in_memory feature. However, if there is no issue with performance or long training time is not a concern, training without this feature is recommended [2]. Furthermore, our observations have shown that using train_in_memory not always increases the performance of the training.

ROC Curves

Feature Importance

We can investigate the feature importance of an artificial classification task. This is similar to that of scikit-learn and has been outlined in [6].

A nice property of DFCs is that the sum of the contributions + the bias is equal to the prediction for a given example.

Feature Importance for a patient

Plot DFCs for an individual patient which is color-coded based on the contributions' directionality and add the feature values on the figure.


References

  1. Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.J., Sandhu, S., Guppy, K.H., Lee, S. and Froelicher, V., 1989. International application of a new probability algorithm for the diagnosis of coronary artery disease. The American journal of cardiology, 64(5), pp.304-310.

  2. Aha, D. and Kibler, D., 1988. Instance-based prediction of heart-disease presence with the Cleveland database. University of California, 3(1), pp.3-2.

  3. Gennari, J.H., Langley, P. and Fisher, D., 1989. Models of incremental concept formation. Artificial intelligence, 40(1-3), pp.11-61.

  4. Regression analysis Wikipedia page
  5. Tensorflow tutorials
  6. TensorFlow Boosted Trees Classifier
  7. Lasso (statistics) Wikipedia page)
  8. Tikhonov regularizationm Wikipedia page
  9. Palczewska A., Palczewski J., Marchese Robinson R., Neagu D. (2014) Interpreting Random Forest Classification Models Using a Feature Contribution Method. In: Bouabana-Tebibel T., Rubin S. (eds) Integration of Reusable Systems. Advances in Intelligent Systems and Computing, vol 263. Springer, Cham
  10. S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Technometrics).
  11. S. Aeberhard, D. Coomans and O. de Vel, “THE CLASSIFICATION PERFORMANCE OF RDA” Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Journal of Chemometrics).